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Abstract 

We propose a batchwise monotone algorithm for dictionary learning. Un¬ 
like the state-of-the-art dictionary learning algorithms which impose sparsity con¬ 
straints on a sample-by-sample basis, we instead treat the samples as a batch, and 
impose the sparsity constraint on the whole. The benefit of batchwise optimization 
is that the non-zeros can be better allocated across the samples, leading to a better 
approximation of the whole. To accomplish this, we propose procedures to switch 
non-zeros in both rows and columns in the support of the coefficient matrix to re¬ 
duce the reconstruction error. We prove in the proposed support switching proce¬ 
dure the objective of the algorithm, i.e., the reconstruction error, decreases mono- 
tonically and converges. Furthermore, we introduce a block orthogonal matching 
pursuit algorithm that also operates on sample batches to provide a warm start. Ex¬ 
periments on both natural image patches and UCI data sets show that the proposed 
algorithm produces a better approximation with the same sparsity levels compared 
to the state-of-the-art algorithms. 


1 Introduction 

A number of algorithms have recently been developed that automatically design repre¬ 
sentations through a process called dictionary learning. The hope is that learning algo¬ 
rithms can exploit structure in specific classes of signals, enabling better performance 
in applications. Dictionary learning algorithms have already been used successfully 
in a number of image processing problems, such as image compression an, inpaint¬ 
ing Mm, image denoising ll8l fTfil |[2l [iJ~2 1 . super-resolution 00, digit recognition, 
and texture classification DU- 
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Popular dictionary learning algorithms can be roughly divided into two categories: 
hard constraint-based © a, and soft sparsity-penalty-based 110111101. These algo¬ 
rithms search for a dictionary of vectors (called atoms) so that it is possible to represent 
each sample signal as a linear combination of a small number of the atoms. Often dic¬ 
tionaries with more atoms than the dimension, called over-complete dictionaries, are 
used. 

Since we would like to use only a few atoms in the representation of each sample, 
a sparsity constraint is imposed on the coefficients in the representations. Both K- 
SVD (81 and the online dictionary learning flTOj algorithm impose sparsity constraints, 
either hard or soft, on the representation of individual samples. However, it may not 
be optimal to assume a similar sparsity level for each sample. In fact, some samples 
could be easy to represent and some may require more atoms in their representations. 
The recent dictionary learning algorithm of ifTTl searches for dictionaries that have a 
sparsity constraint on the number of times each atom is used. Thus, some signals can 
be represented using more atoms than others. Their algorithm inspires us to focus on 
how individual atoms are used rather than how individual signals are represented. 

In this paper we present a monotone algorithm for dictionary learning. Similar to 
03, the algorithm we propose in this work also acts on the rows of the coefficient 
matrix, but can empirically produce good approximations even in the more challeng¬ 
ing (and realistic) conditions. In contrast to the traditional sample-based sparsity con¬ 
straint, we impose the sparsity constraint in a batchwise fashion. That is, we switch 
the positions of the non-zeros in the coefficients within batches of samples among dif¬ 
ferent columns and rows in the coefficient matrix, and at the same time keep the total 
number of non-zeros fixed. As a result, the number of non-zeros, constrained within a 
batch of samples, is allowed to vary in either columns or rows . We show that all the 
non-zero position switching operations only reduce reconstruction error, leading to a 
convergent objective function. For initialization, we introduce a simple iterative dictio¬ 
nary update procedure that operates on a batch of samples to give an approximate guess 
of the dictionary. In each iteration, first the non-zero patterns are derived using a block 
orthogonal matching pursuit and then the dictionary is updated using least squares. 

There are two main advantages of our proposed algorithm: 

1. Since the non-zero positions are optimized in a batchwise fashion, we are able to 
achieve a smaller reconstruction error, or better approximation, compared to the 
traditional sample-by-sample constraint with the same level of sparsity. 

2. The reconstruction error is guaranteed to decrease monotonically and converge. 


2 Notation 

In the dictionary learning problem, one is given a matrix that contains the sample sig¬ 
nals in its columns, Y = [yi,y 2 , ■ ■ ■ ,y P \ £ R mxp , along with a target number of 
atoms, n. The goal is to find a dictionary of atoms A £ R m x and a sparse coefficient 
matrix X eR" xp so that Y ~ AX. 

Throughout this paper, m is the dimension of each sample and p is the number of 
samples. We use a* and Xi to denote the ith column of A and X respectively, and x l 
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to denote the ith row of X. We use O' to denote the support of x'\ O, for the support 
of Xi, k l = |0*|, and ki = (g) is the Kronecker product, I p is a p-by-p identity 

matrix, and y = vec(Y), where vec(-) concatenates the columns of Y to form a vector. 

For a set O C {1,we let Pq to denote the projection matrix Pq = [e Ul , e w ,,..., e 
where ei is the elementary unit vector in coordinate i. We use Yo to denote Y / o • .4/a, 
means the sub-matrix constmcted by removing the ith column of A, and X/x l is the 
sub-matrix constmcted by removing the ith row of X. When describing iterative algo¬ 
rithms, we use x (+l to denote the updated value of x. 

3 Columnwise Sparsity Constraints 

State-of-the-art dictionary learning algorithms treat the input sample matrix Y in a 
sample-by-sample way. That is, the sparseness constraint is imposed on the coefficient 
of each sample independently using the current dictionary, and then the dictionary 
and coefficient matrix are updated accordingly. Among popular dictionary learning 
algorithms, two representative ones are the hard-constraint-based J\-SVD J8l and the 
soft-penalty-based dictionary learning algorithms such as the online learning algorithm 
ED and the efficient sparse coding algorithms Q. 

The J\-SVD algorithm aims to iteratively minimize the objective 

min||Y- AX\\j s.t. Vi, ||xi||o < k, (1) 

A,X 1 

where A is the dictionary and X is the sparse coefficient matrix. 

Empirically the A'-SVD algorithms often works well, but its objective value is not 
guaranteed to decrease monotonically because the support of X is changed using the 
greedy pursuit algorithm one column at a time. 

The algorithms in ED El replace the hard ( 0 penalty with an f j penalty, giving 

mm||Y- AX'!!/+ A^ ||a:i||i. (2) 

i 

This optimization problem is not convex. However, it is convex in X if A is fixed, or 
in A if X is fixed. As in the method of optimal directions (MOD), the optimization is 
done via alternating directions. 

The advantage of column-wise sparsity constraints is that it leads naturally to fast 
online algorithms. Whenever a new sample comes one simply adds a sparsity con¬ 
straint on the incoming column of X. The downside is the sample-by-sample sparsity 
treatment lacks a “global” view of the sparsity pattern. For example, there is no reason 
we should require each sample to be represented by exactly k atoms in the dictionary, 
or impose a sparsity penalty with the same A. Some samples, being “harder” to approx¬ 
imate, require more atoms, and it could be a waste to use too many atoms to represent 
the “easy” samples. 
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Algorithm 1 Inner Row Support Switching 

Input: Y G R mx P, x G ! lxp , a G R mxl , and N. 

Output: G R mxl , and a;( + ) G R lxp . 

Denote the support of x as U, and k = |fi| . 

for r = 1 to TV do 

SVD: F n = criaxn + Er<j<m rj i u i v J’ s - 1 -’ °i > ° 2 ''' > <t m , and ||a|| 2 = 1. 
Use the indices of the k largest {|a T yi|} as Cl. 
x n = a T Y n /\\a\\l, xq.c = 0. 

end for 

a (+) = a, x^ = x. 


4 Batchwise Support Switching Procedures 

Unlike K -SVD and online learning, which constrain the sparsity of the coefficients in 
a column-by-column fashion, we argue that it may be possible to obtain better sparse 
approximations of the input as a whole if we allow the column sparsity to vary. We 
seek the best possible reconstruction, subject to a constraint on the total number of 
nonzeros across the batch of samples: 

min\\Y - AX\\} s.t. \\X\\ 0 < K, (3) 

The advantage is that some non-zero positions with less impact on the objective can 
be replaced using crucial ones. As a result a more accurate decomposition is produced 
with different column sparsities across samples. 

We introduce a heuristic for attacking this problem, which computes an initial spar- 
sifying dictionary using alternating directions, and then refines it using sequence of 
support and amplitude adjustments. The initial approximation makes use of a batch- 
wise orthogonal matching pursuit, which aims at minimizing the /'o norm of X as a 
whole. 

The support switching procedure updates the non-zero positions, i.e., the sparsity 
patterns, in the coefficient matrix X in two ways: inner-row switching and inter-row 
switching. In the inner-row switching, the total number of non-zeros in each row is 
fixed, and the non-zero positions are adjusted within the same row; and in the inter-row 
switching, the total number of non-zeros in pairs of rows is fixed, and the non-zeros are 
changed between two rows. Finally, we introduce an iterative procedure to adjust the 
amplitude of the coefficient and dictionary, with the sparsity pattern fixed. The whole 
algorithm is described in Algorithm [4] We prove that in the procedure of sparsity 
pattern switching and the amplitude adjustment, the objective decreases monotonically 
and converges. 
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Algorithm 2 Inter Row Support Switching 

Input: Y G R mx P, x\ x j G R lxp , and a u a, G R mxl . 

Output: x ; l( ' + ). £•?(+) g R 1 x p. 

1. Denote the support of x l and x :i as 0\ and f V respectively, and 0 = (O' j (_>J j \ 
(f l l n Cl j ). Denote Cl = (Oj n Oj) c . 

2. Form the matrix M = [a*, aj] T Y, q . 

3. Pick up the larger entry of the two rows in each column in \M\ as candidates, and 
use the positions of the largest |0| candidates as O^. 

4. Set X n(+) = M n(+) , X^ Q(+) = 0, and X n . nU . = 

5. Return x l( - + ^ = ejX and x^ +s> = ejX. 


4.1 Inner-Row Support Switching 

Suppose we are given the number of non-zeros k l in each row of X. The problem 
becomes 


min||y-AY||? s.t. \\x l \\ 0 = k l , i = l,...,p. (4) 

A,A' J 

Note here x 1 is the i-th row of X compared to the /-th column in the A'-SVD objective. 
Globally optimizing this objective is challenging, due to the nonconvexity of the con¬ 
straint set and the objective. Similar to K- SVD, we can obtain a simpler subproblem 
by only considering one row at a time, giving 

min ||y — — aja: , ||^ s.t. ||a; l ||o < k l . (5) 

ai,x z ‘ f * 

Setting Y = Y — a j x ' ! > the problem becomes one of finding a rank-one approx¬ 
imation to Y, with at most k l nonzero columns. 

We attack this problem using alternating directions. Assuming the support Cl 1 is 
known and fixed, a best rank one approximation a,iX l can be fit using the SVD of Vo,. 
If, on the other hand a * is fixed, an optimal support f V can be derived by simply ranking 
the absolute values of the projected samples, i.e., \a[y,\. Once the support CV and the 
atom ai are known, x' can be calculated in closed-form. The resulting algorithm is 
listed in Algorithm[l] 


4.2 Inter-Row Support Switching 

In this section, we introduce a procedure to adjust the non-zeros between two rows 
of X, such that the reconstruction error in ((4| decreases and at the same the total num¬ 
ber of non-zeros in the two rows stays the same. First, we define the unique columns 
in X s = [x l ; x J ] to be the symmetric difference of the supports of x l and x ■ 7 : 


Definition Suppose the support of x l and x :l are O' and 1 V respectively, then the index 


set of the unique columns in 


are O = (O' Uff)\ (O i (T QP). 
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If we fix the remaining columns, the residual is Y = Y— [A\{ai, a,}] [A'\{x% tr-'}]. 
Again, if we fix the dictionary atoms a* and a,-, if ||aj|| = \\(ij || = 1, the optimal 
support for the Cl unique columns can be derived by ranking the absolute values of 
the projected sample M = [a$, a,j] T Y( n i nn J ) c with the constraint that we can only pick 
up one non-zero in each column of M. The procedure is described in Algorithm[2] 

The inter-row support switching reduces the objective in ([4ji by fixing the dictio¬ 
nary and comparing the importance of non-zero positions by ranking the projected 
absolute values of the residuals Y. If we run the procedure for all pairs of rows in X, 
the total number of non-zeros in X stays the same but their distribution is optimized 
batchwisely. The procedure is based on fixed dictionary A, and the optimization is 
only carried out on rows of X. Switching supports between all pairs of rows can be 
expensive when the number of rows n in X grows large. In application, we use two 
ways to reduce the computation cost: 

1. Instead of going over all pairs of rows in X, we only go through a randomly 
sampled subset of the (J?) pairs. 

2. The inter-row support switching is only carried out when the objective decreases 
very slowly in the inner-row support switching. 

The inner-row and inter-row support switching interchange the positions of the non¬ 
zeros in the coefficient matrix within a batch of samples. In the following section we 
will introduce a procedure to further reduce the objective by changing the amplitude of 
the entries in both A and X given the support of A'. 


5 Alternating Amplitude Adjustment 


Algorithm 3 Alternating Amplitude Adjustment 
Input: Y G A G R nx *\ A G R mxr \ and N. 

Output: AT(+) g R ux p, and A<+) G R mxr \ 

Denote the support of X as Cl, and the support ofas Cl, . 

for i = 1 to A do 

A = YX t (XX t )~\ 

for j = 1 to j> do 

*i(^) = 0 

end for 
end for 

A(+) = X, AW = A. 


In this section, we propose an alternating optimization algorithm for the following 
problem: 

min \\Y~AX\\ 2 f , s.t. X(Cl c ) = 0 (6) 
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Algorithm 4 BatchSVD 

Input: Y G R mx U A e R mxn ,X G K" x p, N ± , e, N 2 . 

Output: A(+) e R mxri , and X(+) e R nx P. 

Rearrange column of A and rows of A' such that such that k 1 > k 2 > • • • > k n . 

repeat 

for / = I to iV| do 

Run Algorithm [l] with input Y - Y \A \ a ,][X \ x 3 ], x\ a,, and N. Update 
x \ and at. 

end for 

Rescale A and A', such that Vi, ||Aj ||2 = 1. 

for all ( 2 ) pairs of rows {i,j} do 

Run Algorithm[2]with input Y = Y — [A\{aj, Oj}][A\{a:*, x 3 }], x\x °, ai,aj. 

Update x 1 and x 3 . 

end for 

Run Algorithmic] with input Y, A, X. and AA. Update A !i ' and X l+ K 
until \\Y-AX\\f - ||U-A(+)X(+)||| < e 


If X is known and fixed, the optimal A can be computed via least squares: A = 

YX T (XX T )~ 1 . 

On the other hand, given A and Cl, the objective above decomposes into a sum 
of samplewise reconstruction errors: triinx n ||U — AX||j = min^^) E 3 Hi h~ 
An.Xj(£lj)\\%, which amounts to solving least squares for each column of X sample- 
by-sample, that is: Xj (Qj ) = (Aq Afy )~ 1 Aj l /yj, and Xj(Cl p = 0. The detailed 
procedure is shown in Algorithm [3] It is not hard to see that in each iteration the 
objective does not increase. 


6 Proof of Monotonicity 

The full procedure is described in Algorithm [ 4 ] Q We will show in this section that 
all the procedures introduced in section 0 and section ([5]) only decrease the objective 
value, while keeping the total number of nonzero coefficients unchanged. 

Lemma 6.1. The objective Q decreases monotonically in Algorithm^ 7] 


Proof. Let L(ai,x l ,Cl l ) = || Y — aiX z (Cl l )Wj, where \\ = Y — E j^i a o x ^ ■ F° r 
monotonicity, it suffices to show L(a[ + \ x z ( + \ < L(cii, x l , Pi 1 ). 

Sine e L(ai,x l ,fl l ) = \\Yi - (nx z (Cl z )\\ 2 f = \\Y Qi - aiX z Qi \\ 2 f + ||F n “||/,the second 

term || Yq.c || 2 is fixed given U*. Since *(+)} minimizes \\Y(_i, — a,iX z ni \\ 2 for any 

given fi*, we have L(a[ + \ x z ( + fCl l ) < L(a,, x z , Cl z ). 


1 Ways to reduce the computation cost is discussed in section 


4.2 
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Algorithm 5 Dictionary Approximation using Block OMP 
Input: Y G A 0 , TV, T. 

Output: A e R mxn and X £ R nxp . 

Initialize X = 0, and A = A 0 . 

for t = 1 to T do 

(1) X OMP (vec(F), Ip <g> A, TV). 

(2) A <— argmin^ ||Y — AX||j. 

end for 


In the second step is fixed, and w.o.l.g. let us assume ||a 2 - + ^ H 2 = 1. We would 
like to 


min \\Y ~ <4 + V||/ = minmin ^ \\yfj - a[ +) x l {j)\\ 2 2 + ^ ||%| 

J/* r r % O* 'r % ‘ ■* ' ■* 


3 6f2 i 


a 


:1 % n 51 mm 11% - a i +)a;l (j)ll2 + 55 11 % 11 2 = min ^ (||%||i - (<4 +)T %) 2 ) + 55 11% 


•3 112 






= 55 ~ 1T ^ x 55 K- »i) 2 = Ill'll 2 - max||a' +)T f f 


je n* 

,( + )' r T> II2 

fn-lb 




(7) 


jGO i 


If we would like to choose k l non-zeros in the zth row of X, then the optimal way 
of minimizing the objective is to choose the ones with the largest \a[yj\. Thus fP 
corresponding to the k l entries with the largest \a[y,j\s minimizes the reconstruction 
error. And the corresponding x l is determined by the projection of Yq, onto aj +l . □ 

In a similar way, we prove the objective 0 decreases monotonic ally in Algorithm 
[2] The proof is omitted here. 

Convergence of the Objective: Since the objective values generated by the algo¬ 
rithm is a monotonically decreasing sequence of non-negative real numbers, we know 
it converges according to the monotone convergence theorem. 


7 Dictionary Initialization 

The proposed algorithm, though has a convergent objective, is still a local one. A nat¬ 
ural question is how we should initialize the sparsity pattern of X. We use a simple 
batchwise iterative procedure to generate the initialization of the dictionary for Algo¬ 
rithm!] 

The initialization procedure is listed in Algorithm ([5]), where OMP is Orthogonal 
Mathing Pursuit, and OMP (vec(Y), I p 0 A, TV) treats the block of samples as a whole 
compared to the sample-by-sample way in iv-SVD. The input of the algorithm (|5]> 
includes the total number of non-zeros TV in the representation AT of a batch of samples 
Y. There is no guarantee that the Dictionary Approximation algorithm converges, but 
empirically it provides a good initialization for our Batch-SVD algorithm. 
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8 Experiments 


In this section, we compare our proposed approach to state-of-the-art dictionary learn¬ 
ing algorithms on real world data sets, including natural image patches and general 
machine learning sets. The focus of the experiments is data compression. Data com¬ 
pression is a critical application for dictionary learning. Particularly in the big data 
regime, if the samples are represented using only a few coefficients, great storage space 
can be saved. It may also be used in signal communication, where the sender and re¬ 
ceiver keep a copy of the dictionary, and only the sparse coefficients are transmitted. 
In the experiment, we choose a dictionary A and a sparse coefficient matrix X, and try 
to minimize the reconstruction error || Y — AX\\ f with a given number of non-zeros in 
X. The number of non-zeros is set the same as that produced by I \-SVD and online 
dictionary learning, and we compare the reconstruction errors. 

8.1 Data Preparation 

We use 10 data sets in our experiments. The first one is the demonstration image set 
provided in the K -SVD toolbox iTBl . with 5 images: Barbara, boat, house, Lenna, and 
peppers. For each image we randomly sample 3000 overlapping patches of size 8-by- 
8 as the training set, and use another randomly sampled 3000 patches as the testing 
samples in the open-set evaluation. The second data set is the Notre Dame Image 
library which contains 715 images taken from the Notre Dame Cathedral in Paris. To 
make the scales consistent, we resize each image to 512-by-512, and then randomly 
sample 10, 000 patches as the training set. In the testing stage, we randomly sample 
3000 image patches from the image library for a total of 100 runs, and report the mean 
and standard deviation of the reconstruction error. We also carry out experiments on 8 
UCI data sets, including mnist, iris, yeast, glass, wine, ecoli, liver-disorder, and heart- 
diseas^3 

8.2 Demonstration of the Dictionary 

For a better illustration we train a square dictionary using 10, 000 randomly sampled 
image patches of 16-by-16 from the Notre Dame Image data set, and thus the dictio¬ 
nary A has dimensions 256-by-256. The average number of non-zeros per sample is 
||X|| 0 /p « 2.0223, and n = 256. For the MNIST set, we randomly sample 3000 
images from the training set, and learn a 784-by-100 dictionary. The average number 
of non-zeros per sample is around 6. 

8.3 Reconstruction on Natural Image Patches 

We compare our algorithm with the online dictionary learning, A'-SVD, and the over¬ 
complete wavelets with orthogonal matching pursuit. Also the result using a Gaussian 
random dictionary with OMP is presented as a baseline. For the online dictionary 
learning, we set A = 10. 0 Since the sparsity for the online dictionary learning is only 

-We remove the sample columns with 'nan' entries in the heart-disease data set. 

3 We use SPAMS j§J with the default batch size 512 in our evaluation. 
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Figure 1: Demonstration of dictionary atoms learned using our algorithm. The left is 
the learned dictionary from random patches in the Notre Dame library, and the right is 
from MNIST digit set. 


Table 1: Reconstruction Errors on Natural Image Patches. The digit outside the bracket 
is the average L 2 norm of the errors per patch, and the digit inside the bracket is the 
standard deviation. 


(a) Comparison with K -SVD 


xlO 

online 

KSVD 

Wvlet 

Rnd 

Batch 

barbara 

2.9(0.8) 

2.0(1.0) 

3.1(2.5) 

7.4(5.7) 

1.8(0.4) 

boat 

3.2(0.6) 

2.1(0.6) 

3.0(1.9) 

6.4(5.6) 

1.9(0.3) 

house 

2.2(0.8) 

1.7(0.9) 

3.1(2.7) 

6.8(8.3) 

1.5(0.4) 

lena 

2.5(0.6) 

1.8(0.7) 

2.7(2.0) 

6.1(6.2) 

1.7(0.3) 

peppers 

2.9(0.6) 

2.1(0.7) 

3.0(1.8) 

6.2(6.7) 

2.0(0.3) 

ND 

4.3(2.4) 

3.1(2.6) 

4.0(3.8) 

8.2(7.9) 

2.7(1.4) 


(b) Comparison with Error-based KSVD 


ESVD 

KSVD 

Wvlet 

Rnd 

Batch 

2.5(0.4) 

3.1(1.7) 

5.0(4.1) 

10.3(8.1) 

2.4(0.6) 

2.7(0.3) 

3.2(1.1) 

4.7(3.3) 

8.8(7.8) 

2.6(0.4) 

2.2(0.8) 

2.7(1.9) 

5.0(5.1) 

8.6(10.4) 

2.1(0.8) 

2.4(0.4) 

3.0(1.8) 

5.0(4.9) 

8.3(8.7) 

2.3(0.6) 

2.6(0.3) 

2.9(1.1) 

4.3 (3.4) 

7.9(8.6) 

2.5(0.5) 

2.4(0.8) 

2.7(2.2) 

3.4(3.2) 

7.2(6.6) 

2.1(1.0) 
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Table 2: Reconstruction Errors on the UCI Data Sets. The digit outside the bracket is 
the average L 2 norm of the errors per sample, and the digit inside the bracket is the 
standard deviation. 


(a) Comparison with K-SVD 


xl0~ a 

online 

KSVD 

Rnd 

Batch 

liver 

22.6(4.0) 

12(6.9) 

51.9(35.6) 

5.8(7.3) 

iris 

20.1(0.1) 

522.1(275.9) 

102.1(46.3) 

11.1(12.4) 

yeast 

27.8(7.5) 

8.1(8.6) 

36.2(25.6) 

7.9(8.5) 

glass 

20.1(0.2) 

57.6(26.7) 

68.5(35.4) 

0.9(0.5) 

wine 

20.1(0.1) 

64.8(30.9) 

258.1(10.5) 

1.6(0.7) 

ecoli 

27.7(3.5) 

8.7(9.5) 

50.4(37.3) 

2.9(3.7) 

heart 

21.3(0.9) 

173.6(103.5) 

345.6(33.8) 

8.2(2.7) 


(b) Comparison with Error-based KSVD 


ESVD 

KSVD 

Rnd 

Batch 

0.8(1.9) 

7.4(6.5) 

40.7(24.3) 

0.6(1.3) 

491.0(309.7) 

331.4(287.1) 

152.1(55.4) 

2.6(4.1) 

2.0(2.9) 

14.5(15.0) 

37.2(27.4) 

3.1(4.9) 

50.7(65.4) 

45.7(24.3) 

442.3(11.7) 

3.1(23) 

80.0(45.7) 

66.4(39.6) 

797.2(1.5) 

4.8(2.8) 

1.6(2.4) 

16.7(17.9) 

41.0(28.7) 

2.5(3.5) 

172.7(82.4) 

182.7(86.0) 

334.4(22.6) 

5.1(1.6) 


softly constrained, we first run the online dictionary learning algorithm and then force 
k = [ 11 ^online 11 o / p\ in the A'-SVD algorithm, such that the total number of non-zeros 
in the representation derived using K -S V D is not larger than that of the online learn¬ 
ing algorithm. We then set the number of non-zeros in our algorithm to be exactly the 
same as the A'-SVD algorithm. The iteration number of online learning and A'-SVD 
is set as 100. The iteration number for our algorithm is set at 20 with Ni = 3 and 
IV 2 = 10. To accelarate the algorithm, the inter-row support switching is only carried 
out when the objective decrement in the inner-row adjusment is smaller than 0.05. The 
iteration number of the initialization precedure ([5]) is set at 80. For the image data sets, 
the number of atoms in the dictionary A is n = 256. 

8.4 Reconstruction on UCI Data Sets 

We also carried out experiments on the UCI data sets. For all the algorithms, we set 
the number of atoms in the dictionary n = 30. The data vectors are normalized to have 
unit norm before feeding into the algorithms. Again we first run the online dictionary 
learning algorithm with A = 0.02, and then set k = Ul^oratme||o/pJ> where X on u ne 
is the coefficient derived using the online learning algorithm. We set the same number 
of non-zeros for our batch dictionary learning algorithm as that produced by A'-SVD. 
The reconstruction errors are listed in the first part of Table (jTj) and Table (|2j. We can 
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see that the batchwise algorithm works consistently better than the other methods. 


8.5 Reconstruction-Error Based K --SVD 

We also compared with the reconstruction-error-based A'-SVD (ESVD) algorithm pro¬ 
posed in fT31l . Since it is not easy to control exactly the number of non-zeros produced 
by ESVD, again we we first run ESVD and then set the number of non-zeros in our al¬ 
gorithm to be exactly the same as the ESVD algorithm. For comparison the reconstruc¬ 
tion errors of the original K- SVD, wavelets, and random dictionary are also presented 
with k = UI^Cesv,d||o/pJ ■ For the Notre Dame library we set the reconstruction error 
e = 30 , yielding an average sparsity k ~ 9 per sample. For the UCI data sets we set 
e = 0.01. The results are presented in the second part of Table {!]) and |[2}, from which 
we observe that the ESVD algorithm performs reasonably better than the original K- 
SVD algorithm. The batchwise algorithm, with the same sparsity level, gives better 
approximations on all the sets except yeast and ecoli. 

9 Conclusion 

In this paper we propose a monotone dictionary learning algorithm that is optimized for 
sample batches. The reconstruction error is minimized by a series of support switching 
procedures withing the sample batch. We prove the objective monotonically decreases 
and converges in the support switching procedures. Using the proposed block orthogo¬ 
nal matching pursuit algorithm as a warm start, the batchS VD algorithm gives a better 
approximation in terms of the reconstruction error at the same level of sparsity. 
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